Armance Larfeuille, Katia Voltz, Manoël Pidoux, Elodie Shoeiline Kwan and Nina Bidet

Teacher : O. Boldi Spring Semester 20222

Introduction

We are doing this report as part of our “Machine Learning” course in which we have to produce a detailed report based on an original database. To do this, we will re-analyze an existing data set by adding new elements (new models, features, outcome labels, approaches, interpretations, etc). Our goal is to classify mountains according to different features. To do this, we will go though supervised and unsupervised learning methods.

Overview and motivation

After several searches, we found a very interesting article explaining the interactions between the impacts of climate, soil and biotic on the interplay of the different facets of Alpine plant diversity. The scientists collected these data on three mountains in different locations and with opposite characteristics.

  1. ‘Guadarrama - Sierra de Guadarrama’ in Spain with a Mediterranean climate
  2. ‘Ordesa ~ Monte Perdido - Central Pyrenees’ in Spain with a Temperate climate
  3. ‘Central Andes’ in Chile with a Mediterranean climate

The database provided with the article was detailed, interesting and included enough observations to conduct our project. An analysis has already been carried out on this data set and we will complete it by using the skills acquired during our semester of course. Indeed, we will add machine learning methods of classification and regression.

Project objectives

Our goal is to use the relevant machine learning tools, with supervised and unsupervised learning methods to characterize our data frame. In our case, it would be a question of being able to determine which mountain it is according to the given soil characteristics.

Structure of the report

Introduction Part 1 : Data preparation Part 2 : Exploratory Data Analysis (EDA) Part 3 : Analysis - Supervised learning analysis - Unsupervised learning analysis Conclusion

Part 1 : Data Preparation

In this first part, we will first download the data into R. Then, we will apply the necessary transformations on these data so that they can be correctly exploited for the following analysis.

Import dataset

Sample_ID Country Muntain Range Locality Plot Plot_ID Subplot Date Day Month Year
1 Spain Sierra de Guadarrama Asomate-Hoyos 1 1 1 2010-07-21 21 7 2010
2 Spain Sierra de Guadarrama Asomate-Hoyos 1 1 2 2010-07-21 21 7 2010
3 Spain Sierra de Guadarrama Asomate-Hoyos 1 1 3 2010-07-21 21 7 2010
4 Spain Sierra de Guadarrama Asomate-Hoyos 1 1 4 2010-07-21 21 7 2010
5 Spain Sierra de Guadarrama Asomate-Hoyos 1 1 5 2010-07-21 21 7 2010
6 Spain Sierra de Guadarrama Bola del Mundo 2 2 1 2010-06-22 22 6 2010
Grid zone UTM_X UTM_Y Elevation Orientation Slope Radiation Phos_P Glu_P SOC_P NT_P
30 T 425862.7 4517724 2225 175 4 0.8088464 4.437515 2.505851 6.315554 4.070239
30 T 425862.7 4517724 2225 175 4 0.8088464 4.437515 2.505851 6.315554 4.070239
30 T 425862.7 4517724 2225 175 4 0.8088464 4.437515 2.505851 6.315554 4.070239
30 T 425862.7 4517724 2225 175 4 0.8088464 4.437515 2.505851 6.315554 4.070239
30 T 425862.7 4517724 2225 175 4 0.8088464 4.437515 2.505851 6.315554 4.070239
30 T 417601.1 4515745 2242 182 2 0.7879971 5.171000 3.234583 5.092630 4.546233
PT_P K_P pH_P Cond_P Phos_B Glu_B SOC_B NT_B PT_B K_B
0.456501 0.0086492 4.925 31.284 2.888396 1.691185 3.762122 3.297689 0.4450694 0.0025165
0.456501 0.0086492 4.925 31.284 2.888396 1.691185 3.762122 3.297689 0.4450694 0.0025165
0.456501 0.0086492 4.925 31.284 2.888396 1.691185 3.762122 3.297689 0.4450694 0.0025165
0.456501 0.0086492 4.925 31.284 2.888396 1.691185 3.762122 3.297689 0.4450694 0.0025165
0.456501 0.0086492 4.925 31.284 2.888396 1.691185 3.762122 3.297689 0.4450694 0.0025165
3.923043 0.0133951 5.232 48.020 3.102327 1.955476 2.993161 3.168254 3.1682544 0.0067595
pH_B Cond_B Phos_T Glu_T SOC_T NT_T PT_T K_T pH_T Cond_T
5.402 22.198 3.593446 2.203418 5.736881 3.958871 0.4871885 0.0059856 5.214310 27.60922
5.402 22.198 3.593446 2.203418 5.736881 3.958871 0.4871885 0.0059856 5.214310 27.60922
5.402 22.198 3.593446 2.203418 5.736881 3.958871 0.4871885 0.0059856 5.214310 27.60922
5.402 22.198 3.593446 2.203418 5.736881 3.958871 0.4871885 0.0059856 5.214310 27.60922
5.402 22.198 3.593446 2.203418 5.736881 3.958871 0.4871885 0.0059856 5.214310 27.60922
5.760 28.700 5.607705 3.481617 5.288592 4.912220 4.8853870 0.0153168 5.505655 48.93783

We are deleting the variables that are not useful for use. The variables such as the UTM grid zone, UTM_X and UTM_Y are related to the samples and we are not wanting to make predictions based on those characteristics but rather on the actual soil composition. The variable Elevation, slope, plot_ID, Orientation will not help us neither because it is related to the place where the sample has been taken. It is a certain amount above the some tree limits. We rename the variable Muntain Range as Mountain_range to facilitate future use.

After these changes, he database Mountain_data is composed of 430 observation and 34 variables.

  • Country : chr Spain, Chile
  • Mountain_range : chr Sierra de Guadarrama, Central Pyrenees
  • Locality : chr Locality of the sample (there is different location on the same mountain)
  • Plot : num In each site a sampling plot of 20 × 20 m was etablished in a relatively homogeneous vegetation area.
  • Subplot : num There is 5 subplots in each plot
  • Date : Date Date of the sample
  • Day : num Day of the sample
  • Month : num Month of the sample
  • Year : num Year of the sample
  • Radiation : num Potential solar radiation (0-1)
  • Phos_P : num Phosphatase enzyme. Measured on dry soil sampled under plant (µmol/gr dry soil/h)
  • Glu_P : num β-glucosidase enzyme. Measured on dry soil sampled under plant canopy (µmol/gr dry soil/h)
  • SOC_P : num soil organic carbon. Measured on dry soil sampled under plant canopy (%)
  • NT_P : num soil total nitrogen. Measured on dry soil sampled under plant canopy (mg/g soil)
  • PT_P : num available phosphorus. Measured on dry soil sampled under plant canopy (mg/g soil)
  • K_P : num potassium content. Measured on dry soil sampled under plant canopy (mg/g soil)
  • pH_P : num pH. Measured on dry soil sampled under plant canopy
  • Cond_P : num electrical conductivity. Measured on dry soil sampled under plant canopy (µS/cm)
  • Phos_B : num Phosphatase enzyme. Measured on dry soil sampled from bare soil areas (µmol/gr dry soil/h)
  • Glu_B : num β-glucosidase enzyme. Measured on dry soil sampled from bare soil areas (µmol/gr dry soil/h)
  • SOC_B : num soil organic carbon. Measured on dry soil sampled from bare soil areas (%)
  • NT_B : num soil total nitrogen. Measured on dry soil sampled from bare soil areas (mg/g soil)
  • PT_B : num available phosphorus. Measured on dry soil sampled from bare soil areas (mg/g soil)
  • K_B : num potassium content. Measured on dry soil sampled from bare soil areas (mg/g soil)
  • pH_B : num pH. Measured on dry soil sampled from bare soil areas
  • Cond_B : num electrical conductivity. Measured on dry soil sampled from bare soil areas (µS/cm)
  • Phos_T : num Phosphatase enzyme. Values averaged and weighted by the respective cover of bare ground and vegetated area in each site (µmol/gr dry soil/h)
  • Glu_T : num β-glucosidase enzyme. Values averaged and weighted by the respective cover of bare ground and vegetated area in each site (µmol/gr dry soil/h)
  • SOC_T : num soil organic carbon. Values averaged and weighted by the respective cover of bare ground and vegetated area in each site (%)
  • NT_T : num soil total nitrogen. Values averaged and weighted by the respective cover of bare ground and vegetated area in each site (mg/g soil)
  • PT_T : num available phosphorus. Values averaged and weighted by the respective cover of bare ground and vegetated area in each site (mg/g soil)
  • K_T : num potassium content. Values averaged and weighted by the respective cover of bare ground and vegetated area in each site (mg/g soil)
  • pH_T : num pH. Values averaged and weighted by the respective cover of bare ground and vegetated area in each site
  • Cond_T : num electrical conductivity. Values averaged and weighted by the respective cover of bare ground and vegetated area in each site (µS/cm)

Now that we have the variables we want to work with, we see that they are all considered as numeric. However, we know that some of them should be categorical as they have different levels. We are going to transform Country, Mountain_Range, Locality, Plot and Subplot as factors.

We can export this cleaned dataset on the data folder of our project.

write.csv(Mountain_data,"../data/Mountain_data_cleaned.csv", row.names = FALSE)




For a better representation of the data we created 3 maps with each icon representing a subsample belonging to a larger sample. Each map represents a mountain region, two in Spain and one in Chile.

Sierra de Guadarrama (Spain)

Mediterranean climate



Central Pyrenees (Spain)

temperate climate



Chilean Andes (Chile)

Mediterranean climate

part 2 : Exploratory Data Analysis



This section is dedicated to understanding the data. We will provide an analysis of the data set using a visual approach in order to summarize their main characteristics.

Let’s import the cleaned dataset that we created.

Mountain_data_cleaned <- read.csv("../data/Mountain_data_cleaned.csv")


We will analyze some basic statistical elements for each variable. To do this we need to transform the variable Date to date format.

Mountain_data_cleaned$Date <- as.Date(Mountain_data_cleaned$Date)


We first look at the general aspect of the data set and try to discover if there are any missing values.

rows columns all_missing_columns total_missing_values complete_rows total_observations
430 34 0 2 428 14620


There are 2 missing values in the feature Glu_P. One is at the instance number 377 and the other one is at the instance 378.

Country Mountain_range Locality Plot Subplot Date Glu_P
377 Chile Central Andes Baños de Colinas 76 2 2014-01-21 NA
378 Chile Central Andes Baños de Colinas 76 3 2014-01-21 NA


To understand the distribution of our data in the data set we use the following graph:


All the variables that we will train for our models come from columns composed of continuous values, which explains their predominance. We can also notice that the share of missing observations represents only 0.014% of the total number of observations. Which at first sight makes it a good data set.

We take our search for anomalies further by exploring the characteristics of each variable.For the readability of the report, we show only a few variables.

## Country 
##        n  missing distinct 
##      430        0        2 
##                       
## Value      Chile Spain
## Frequency    100   330
## Proportion 0.233 0.767
## Mountain_range 
##        n  missing distinct 
##      430        0        3 
##                                                                          
## Value             Central Andes     Central Pyrenees Sierra de Guadarrama
## Frequency                   100                  135                  195
## Proportion                0.233                0.314                0.453
## Phos_P 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      430        0      268        1    3.477     2.26   0.6777   1.0753 
##      .25      .50      .75      .90      .95 
##   1.9318   3.1503   4.7146   6.2213   7.1411 
## 
## lowest : 0.01980997 0.16225689 0.25041658 0.31268048 0.32430518
## highest: 7.73917664 7.90167428 8.05414937 8.64896403 8.64973041
## Glu_P 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      428        2      272        1    2.102     1.24   0.2986   0.5062 
##      .25      .50      .75      .90      .95 
##   1.3679   2.0922   2.7710   3.2561   3.9541 
## 
## lowest : 0.1074305 0.1110000 0.1140850 0.1150734 0.1619782
## highest: 4.5538748 4.7529319 5.2224173 5.5441612 6.3505287
## NT_P 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      430        0      273        1    3.971    2.688   0.4501   0.7760 
##      .25      .50      .75      .90      .95 
##   2.1575   3.9155   5.6767   7.2597   8.0787 
## 
## lowest :  0.1156901  0.1610857  0.1833210  0.2201596  0.2365415
## highest:  8.4518587  8.9110000  9.0285000 10.8160000 18.0010000

We can observe that there is a big difference between the total number of observations and the number of distinct observations for the variables related to the chemical elements. We will try to understand where this difference comes from in the visual analysis part.


Data visualization : Plotting the data



We plot the numerical variables.


Many variables appear with a distribution that looks like the log-normal distribution.

  • The variables pH_B and pH_T are not normally distributed.
  • Most of the variables are right-tailed.



We will then use box-plots to detect outliers on numerical variable and compare the distributions to each mountain class.


We can see the real differences between the mountains.

  • Generally speaking, it seems that the mountain “Sierra de Guadarrama” has a higher value of Glu_P, Phos_P, SOC_P, Glu_B, Phos_B, SOC_B, Glu_T, Phos_T and SOC_T.
  • There are a lot of outliers in almost all the features.



We then plot the categorical variables:


We see that more observations come from Spain, which is normal since two out of three mountains are located in Spain. The localities where the samples were taken are almost all composed of a sample of 5 subsamples. Some localities, perhaps more interesting for the study, were sampled several times, but always by a multiple of 5 subsamples.


We have more observations about the mountain “Sierra de Guadarrama” (195) compared to “Central Andes” (100) and “Central Pyrenees” (135). As the differences between the number of observation is big enough for us to be careful on the results and consider to balance the data if it is needed.

Can comment on the number of different observations and the effect on accuracy. Can focus more on sensibility and sensitivy if needed because indeed there is twice more informations on Sierra de Guadarrama.

As described above with the summary output of the data, we see that we have more information on the mountain “Sierra de Guadarrama”. There is twice more information compared to the mountain “Central Andes”. Our final result might be affected on a bad way because the model will tend to produce a good accuracy (so having a tendency to predict “Sierra de Guadarrama” more often) but it will not be good enough to predict a new instance.

We will have to see if we will need to balance our data to get a better model.

We will also inspect the possible duplicate observations, indeed as previously found, some variables do not have the whole of their observations which are distinct.


We notice immediately the poverty of the data concerning the samples of Sierra de Guadarrama, this function leaves us with a data set of only 274 observations. We will therefore try a first time to implement our models by keeping the duplicates, knowing that identical values in the train set and the test set will influence the measured accuracy of the model. Then we will test again our models with the reduced data set to observe if there is a loss of accuracy.

For the rest of the EDA we will continue the analysis on the complete data set.


From the correlation plot it seems that some pattern can be observed. The variables concerning the Phosphatase enzyme seems to be positively correlated with the variable about Soil organic carbon.


With this plot, we see indeed that the families of Soil organic carbon and Phosphatase enzyme are significantly positively correlated. The correlation coefficient going from 0.739 (SOC_B - Phos_P) to 0.947 (SOC_T - SOC_P).

Principal Component Analysis (PCA) exploration:


  • This analysis helps us to understand the link between the explanatory variables.


The first step is to analyse the data in the covarianve matrix as we did before, and where we found the positive correlation between the Soil organic carbon and Phosphatase enzyme.
The second step is to group the data into Principal Components.
The third step is to produce a variable factor map to better understand the role of each factor.


Second step - Principal Component Result:

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.3080 2.2090 1.6387 1.26851 1.02032 0.97230 0.82856
## Proportion of Variance 0.4377 0.1952 0.1074 0.06436 0.04164 0.03781 0.02746
## Cumulative Proportion  0.4377 0.6329 0.7403 0.80469 0.84633 0.88414 0.91161
##                           PC8     PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.7106 0.63948 0.58236 0.5050 0.41430 0.37199 0.32284
## Proportion of Variance 0.0202 0.01636 0.01357 0.0102 0.00687 0.00554 0.00417
## Cumulative Proportion  0.9318 0.94816 0.96173 0.9719 0.97879 0.98433 0.98850
##                           PC15    PC16    PC17    PC18    PC19   PC20    PC21
## Standard deviation     0.25571 0.23393 0.21525 0.17576 0.17150 0.1415 0.12739
## Proportion of Variance 0.00262 0.00219 0.00185 0.00124 0.00118 0.0008 0.00065
## Cumulative Proportion  0.99111 0.99330 0.99515 0.99639 0.99757 0.9984 0.99902
##                          PC22    PC23    PC24    PC25
## Standard deviation     0.1115 0.08012 0.05935 0.04711
## Proportion of Variance 0.0005 0.00026 0.00014 0.00009
## Cumulative Proportion  0.9995 0.99977 0.99991 1.00000
  • We see that the first component explain 43.77%% of the overall variation. -The second component explain a further 19.52%.
  • With a reduction of 4 principal components, we obtain a cumulative variance of 80.5%, superior to the threshold of 75%.
  • The rest of the components (Components 4 to 25) explain 19.5% overall.

Here, as the command prcomp do not allow NAs in the data. We use the command na.omit on our reduced data containing the numerical values to omit all NAs cases from the data frame.

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 430 individuals, described by 25 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

For the further analysis, we can study as well the eigenvalues in order to select a good number of components.

Eigenvalue analysis:

##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  10.936952114     43.747808455                    43.74781
## Dim.2   4.863450112     19.453800446                    63.20161
## Dim.3   2.694400932     10.777603728                    73.97921
## Dim.4   1.602728782      6.410915127                    80.39013
## Dim.5   1.045432762      4.181731049                    84.57186
## Dim.6   0.954822102      3.819288408                    88.39115
## Dim.7   0.688518236      2.754072944                    91.14522
## Dim.8   0.503178240      2.012712959                    93.15793
## Dim.9   0.411065663      1.644262652                    94.80220
## Dim.10  0.336691388      1.346765554                    96.14896
## Dim.11  0.254731126      1.018924506                    97.16789
## Dim.12  0.172594714      0.690378855                    97.85826
## Dim.13  0.138243331      0.552973325                    98.41124
## Dim.14  0.105400519      0.421602078                    98.83284
## Dim.15  0.067346066      0.269384264                    99.10222
## Dim.16  0.054954619      0.219818475                    99.32204
## Dim.17  0.046558386      0.186233542                    99.50828
## Dim.18  0.031588992      0.126355968                    99.63463
## Dim.19  0.029601829      0.118407318                    99.75304
## Dim.20  0.020191077      0.080764308                    99.83380
## Dim.21  0.016267999      0.065071996                    99.89888
## Dim.22  0.012623924      0.050495696                    99.94937
## Dim.23  0.006886114      0.027544454                    99.97692
## Dim.24  0.003535893      0.014143571                    99.99106
## Dim.25  0.002235080      0.008940321                   100.00000

We obtain the cumulative variance, as before, and also the eigenvalues.

  • A first rule of thumb is to stop adding components whent the total variance explained exceeds a high value, like 80% for example.
  • Another rule is the Kaiser-Guttman rule. The Kaiser-Guttman rule states that components with an eigenvalue greater than 1 should be retained.The reason for this is we have p variables so the sum of the eigenvalues is p. A value above 1 is above average.


Therefore, we can consider the dimension from 1 to 5:
Cumulative variance: 84.57%
Eigenvalue > 1

Screeplot of eigenvalue.


  • A screeplot represents the values of each eigenvalue.
  • According to the Kaiser-Guttman rule, we shold stop at Component 5.



Third step - Variable Factor Map:

The variable factor map show the variables and organized them along dimensions. Here the first two dimensions are represented.



Other representation:


Dimension 1 (x-axis): highly correlated to Phos_T, Phos_B, Phos_P and Glu_T
Dimension 1 is moderately correlated to PT_B
Dimension 1 is poorly correlated to Cond_T and Cond_B.
Dimension 2 is well correlated to Cond_T and Cond_B.
Dimension 2 is also moderatly negatively correlated to Radiation.
It seems that we have 4 groups of variables playing a different role. On these two dimensions we notice that the mountain classes already separate into 3 distinct clusters

  • Indeed, Sierra de Guadarrama is more positively correlated to the Dim1
  • Central Andes is porely negatively correlated to the Dim 1, bur more negatively correlated to Dim 2: Negatively with Cond_T
  • Central Pyrenees is negatively correlated to Dim 1, and positively correlated to Dim 2.


Study of each variable according to the 5 dimensions:



Dim 1: Highly correlated to the PHOS, SOCand GLU
Dim 2: Correlated with Cond_P and Cond_T
Dim 3: Correlated with PT_P, PT_B and PT_T
Dim 4: Moderately correlated to K_B and K_T
Dim 5: Correlated to Radiation

Put into representation the 5 dimensions:


The square cosine shows the importance of a component for a given observation. It is therefore normal that observations close to the origin are less significant than those far from it. Here we decided to represent only one variable of each type since the same chemical elements tend to have the same behavior independently of their sampling method. A variable that has an interesting behavior is Radiation, indeed the more we select high dimensions the more this variable becomes important (except for dimension 4), while the variables related to chemical elements tend to decrease. Thus, we find radiation strongly correlated with dimension 5.

Analysis in 3D

As seen in the EDA, we can consider 5 dimensions. In the following graph we reduce in 3 dimensions the 3 mountains. Clusters may be apparent.



Through this 3D, we can observe the repartition of the 3 mountain in the PCA. ‘Sierre de Guaderrama’ shows a high correlation to Dim 1. Further in the analysis we will do a cluster analysis, to better understand the apparent classification between the mountains.

Part 3 : Ananlysis

Import dataset

Let’s import the cleaned dataset that we created.

Mountain_data_cleaned <- read.csv("../data/Mountain_data_cleaned.csv")

Replace the NAs by the mean of the closest observations

Some of the models we use do not work with NAs. To deal with them, we decided to replace the 2 NAs we have by the mean of their closest observations.

Splitting the data into Traning set and Test set

Balancing the Training set

As discussed in the EDA part, we should balance our data because we do not have the same amount of information on each mountains. We have more observations on Sierra de Guadarrama and half less on Central Andes.

## 
##        Central Andes     Central Pyrenees Sierra de Guadarrama 
##                   79                   79                   79

Neural Network Model

Simple hyperparameter tuning, this code takes time to run.

set.seed(1)
fitControl <- trainControl(method = "cv", 
                           number = 10)

nnetGrid <-  expand.grid(size = seq(from = 1, to = 6, by = 1),
                        decay = seq(from = 0.1, to = 0.5, by = 0.1))

nnetFit <- train(Mountain_range ~ ., 
                 data = Mountain_data.tr,
                 method = "nnet",
                 metric = "Accuracy",
                 tuneGrid = nnetGrid,
                 trControl = fitControl)

The best Neural Networks parameters would be to choose 4 hidden layers, with a decay of 0.1.

The manually written Neural Network model

## # weights:  747
## initial  value 343.739521 
## iter  10 value 260.410032
## iter  20 value 257.569875
## iter  30 value 255.306821
## iter  40 value 230.131410
## iter  50 value 63.237367
## iter  60 value 49.787755
## iter  70 value 49.290640
## iter  80 value 33.426478
## iter  90 value 19.428864
## iter 100 value 15.167996
## final  value 15.167996 
## stopped after 100 iterations
##                       Pred
## Obs                    Central Andes Central Pyrenees Sierra de Guadarrama
##   Central Andes                   21                0                    0
##   Central Pyrenees                 1               28                    0
##   Sierra de Guadarrama             0                0                   58
## [1] 0.9907407

Here it says that it has almost perfect accuracy (99%).

## Confusion Matrix and Statistics
## 
##                       Reference
## Prediction             Central Andes Central Pyrenees Sierra de Guadarrama
##   Central Andes                   21                1                    0
##   Central Pyrenees                 0               28                    0
##   Sierra de Guadarrama             0                0                   58
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9907          
##                  95% CI : (0.9495, 0.9998)
##     No Information Rate : 0.537           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9846          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Central Andes Class: Central Pyrenees
## Sensitivity                        1.0000                  0.9655
## Specificity                        0.9885                  1.0000
## Pos Pred Value                     0.9545                  1.0000
## Neg Pred Value                     1.0000                  0.9875
## Prevalence                         0.1944                  0.2685
## Detection Rate                     0.1944                  0.2593
## Detection Prevalence               0.2037                  0.2593
## Balanced Accuracy                  0.9943                  0.9828
##                      Class: Sierra de Guadarrama
## Sensitivity                                1.000
## Specificity                                1.000
## Pos Pred Value                             1.000
## Neg Pred Value                             1.000
## Prevalence                                 0.537
## Detection Rate                             0.537
## Detection Prevalence                       0.537
## Balanced Accuracy                          1.000

With this confusion matrix command we have more information on the model. As said before, we see that the accuracy is very high (99%) and we also see that we have a balanced accuracy of 1 which is the maximum we can get and which mean that our model do not suffer from unbalanced data.

Random Forest

##           Central Andes Central Pyrenees Sierra de Guadarrama
## Radiation    3.13737888         2.303583            2.6745117
## Phos_P       3.16069796         5.801423            6.5140634
## Glu_P        1.60137498         1.398899            1.6249260
## SOC_P        4.01423103         4.188059            4.1334148
## NT_P         5.11591652         3.316876            3.7355699
## PT_P         2.95677261         4.311597            1.5562930
## K_P         -0.09619991         1.331178            3.4685762
## pH_P        15.05352536        13.020120           16.9881797
## Cond_P       4.60393107         5.155277            3.3775076
## Phos_B       4.27608539         6.442918            6.0405404
## Glu_B        2.26558677         1.799629            3.7644969
## SOC_B        7.40221740         2.381187            7.6427334
## NT_B         9.01514124         5.587019            7.8622947
## PT_B         3.33835698         1.857367            1.6481046
## K_B          3.42100795         2.820063            2.6648140
## pH_B        16.49696920        15.863898           14.1175546
## Cond_B      10.68856356        10.045402            4.9915864
## Phos_T       4.19140393         7.942887            7.4260798
## Glu_T        1.12669814         2.199217            2.6690536
## SOC_T        7.45558803         2.924291            6.6968826
## NT_T        10.34572042         6.735634            8.2957005
## PT_T         3.23013037         3.002285           -0.1739031
## K_T          2.05614399         2.730476            2.4998274
## pH_T        15.48328130        14.648225           13.9304162
## Cond_T       9.81844426        10.168778            4.8952827
##           MeanDecreaseAccuracy MeanDecreaseGini
## Radiation             4.116329        0.3343212
## Phos_P                7.057318        3.7516764
## Glu_P                 2.398319        0.1122100
## SOC_P                 5.824931        2.4331305
## NT_P                  5.711238        2.8770588
## PT_P                  4.676746        0.3779176
## K_P                   3.013458        0.3661828
## pH_P                 17.947560       26.0658672
## Cond_P                6.498981        2.7284880
## Phos_B                7.853608        4.2337452
## Glu_B                 4.371594        1.6683522
## SOC_B                 8.984614        6.3292244
## NT_B                  9.975941        8.5198157
## PT_B                  3.375704        0.2074085
## K_B                   4.424175        0.6111154
## pH_B                 18.786750       25.8592483
## Cond_B               11.249806        9.8605651
## Phos_T                8.760424        6.6811926
## Glu_T                 3.180503        0.9214452
## SOC_T                 8.161092        6.6922297
## NT_T                 10.338053       10.4797244
## PT_T                  3.983457        0.3729806
## K_T                   3.381527        0.3913723
## pH_T                 18.236854       27.1060341
## Cond_T               11.221627        8.3969217

From the variable importance we see that the variable pH_P, pH_T and pH_B are very important. We can then say that the pH in general is important.

##                       predtr
##                        Central Andes Central Pyrenees Sierra de Guadarrama
##   Central Andes                   79                0                    0
##   Central Pyrenees                 0               79                    0
##   Sierra de Guadarrama             0                0                   79
## [1] 1
##                       predte
##                        Central Andes Central Pyrenees Sierra de Guadarrama
##   Central Andes                   20                1                    0
##   Central Pyrenees                 0               29                    0
##   Sierra de Guadarrama             0                0                   58
## [1] 0.9907407

We see that the model has a accuracy of 99% when it comes to predict new observations.

Naive Bayes

PH

#### Phosphatase enzyme #### β-glucosidase enzyme #### soil organic carbon #### Phosphatase enzyme #### β-glucosidase enzyme #### soil organic carbon #### soil total nitrogen #### electrical conductivity #### radiation

The analysis of the “Naive Bayes” model comes back to the analysis of the density graphs of each variable by mountain range. From these density graphs we agree on what has been seen before, that the pH is a very good feature to classify the mountains. Indeed, pH_T ph_B and ph_P is medium in Central Andes, lower in Sierra de Guadarrama and higher in Central Pyrenees.

In addition to that, the observation of other variables such as phosphatase enzyme, (Phos_P, Phos_B, Phos_T), β-glucosidase enzyme (Glu_B, Glu_P, Glu_T) , soil organic carbon (SOC_T, SOC_B, SOC_P), soil total nitrogen (NT_P,NT_B,NT_T), electrical conductivity (Cond_P,Cond_B, Cond_T), and the radiation could allow us to get an idea of the mountain we are on. Indeed, a radiation lower than 0,6 indicates rather the Central Pyrenees. The electrical conductivity allows us to distinguish the Central Andes from the Central Pyrenees, since it is higher for the latter. If we look at the soil total nitrogen, we see that it is lower for the Central Andes than for the two others. The soil organic carbon allows us to distinguish the Central Andes from the Sierra de Guadarrama, since it is higher for this last one. The β-glucosidase enzyme is present at approximately the same level in Central Andes and Central Pyrenees but has a higher value in Sierra de Guadarrama. It is about the same for the phosphatase enzyme

However, some variables do not allow us to determine the mountain at all. It is the case by observing the phosphorus (PT_P, PT_B, PT_T) or the potassium content (K_P, K_B, K_T).

K-NN Model

We use a 2-NN to predict the test set using the training set

##                       Pred
## Obs                    Central Andes Central Pyrenees Sierra de Guadarrama
##   Central Andes                   21                0                    0
##   Central Pyrenees                 0               29                    0
##   Sierra de Guadarrama             0                0                   58
## [1] 1

With this KNN supervised method, the prediction made on the test set gives us an accuracy of 1. It looks indeed perfect and attractive, but it is complytely biased by the fact that subplot can be identical to plot. Therefore, there is an overlay in the pbservations and the distance computed is 0. What we need to do is to make a unite dataset with no observations overlayed.


Analysis with unique dataset

We now want to replicate the above models with a another training set and test set that has been through a boostraping method. To do so, we first want to delete all the repeated observations from the original cleaned dataset.

Unique

We remove every replicated data with the “unique” function.

Count

ggplot(Mountain_df) +
  geom_bar(aes(x = Mountain_range))

Here we see the new Unique dataset and notice that it is largely unbalanced. The mountain that has the most information is now the one with the less observations, so information.

table(Mountain_df$Mountain_range)
## 
##        Central Andes     Central Pyrenees Sierra de Guadarrama 
##                  100                  135                   39

Sierre de Gaudarrame now only have 39 observations. It is unbalanced indeed, and insufficient to make accurate classification methods.

Split in a new training set and test set

Bootstrap

We can now proceed to the bootstraping with 100 replicates

## [1] 1 2 3 4 5 5
## [1] 198 198 199 200 201 204
## [1] 205  28
## [1] 73 28

We can now replicate the models.

Neural Network Model (with Unique bootstraped data)

Simple hyperparameter tuning, this code takes time to run.

set.seed(1)

nnetFit_boot <- train(Mountain_range ~ ., 
                 data = df.boot.tr,
                 method = "nnet",
                 metric = "Accuracy",
                 tuneGrid = nnetGrid,
                 trControl = fitControl)

The best neural network should have 4 hidden units and a decay of 0.1. It is the same as before.

## # weights:  375
## initial  value 284.692893 
## iter  10 value 116.079753
## iter  20 value 57.538311
## iter  30 value 29.136661
## iter  40 value 22.652969
## iter  50 value 21.197795
## iter  60 value 20.339347
## iter  70 value 18.971778
## iter  80 value 18.386184
## iter  90 value 18.351844
## iter 100 value 18.341811
## final  value 18.341811 
## stopped after 100 iterations
## Confusion Matrix and Statistics
## 
##                       Reference
## Prediction             Central Andes Central Pyrenees Sierra de Guadarrama
##   Central Andes                   24                0                    0
##   Central Pyrenees                 0               39                    0
##   Sierra de Guadarrama             0                0                   10
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9507, 1)
##     No Information Rate : 0.5342     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Central Andes Class: Central Pyrenees
## Sensitivity                        1.0000                  1.0000
## Specificity                        1.0000                  1.0000
## Pos Pred Value                     1.0000                  1.0000
## Neg Pred Value                     1.0000                  1.0000
## Prevalence                         0.3288                  0.5342
## Detection Rate                     0.3288                  0.5342
## Detection Prevalence               0.3288                  0.5342
## Balanced Accuracy                  1.0000                  1.0000
##                      Class: Sierra de Guadarrama
## Sensitivity                                1.000
## Specificity                                1.000
## Pos Pred Value                             1.000
## Neg Pred Value                             1.000
## Prevalence                                 0.137
## Detection Rate                             0.137
## Detection Prevalence                       0.137
## Balanced Accuracy                          1.000

KNN model (with Unique bootstraped data)

##                       Pred
## Obs                    Central Andes Central Pyrenees Sierra de Guadarrama
##   Central Andes                   24                0                    0
##   Central Pyrenees                 0               39                    0
##   Sierra de Guadarrama             0                3                    7
## [1] 0.9589041

Now, the accuracy is of 95.89%, lower than before, but might be more realistic with the unique method and the bootstrap.


Unsupervised learning method

Cluster Analysis

Agglomerative Hierarchical Clustering (AGNES) with Manhattan distance

##   Id Clust  variable     value
## 1 M1     1 Radiation 0.8088464
## 2 M2     1 Radiation 0.8088464
## 3 M3     1 Radiation 0.8088464
## 4 M4     1 Radiation 0.8088464
## 5 M5     1 Radiation 0.8088464
## 6 M6     2 Radiation 0.7879971

## Medoids:
##       ID   Radiation     Phos_P      Glu_P       SOC_P       NT_P        PT_P
## M141 141 -0.03267093  0.3045220  0.5926840 -0.03088487 -0.3451440 -0.49409941
## M10   10  0.45536646  0.8451473  1.0224398  0.34835885  0.2399536  4.00384317
## M115 115  0.40849005  1.0207699  0.3845950  0.93075989  0.8733244 -0.02563892
## M231 231 -0.16053254 -0.7836337 -0.4194728 -0.48312210  0.7438410 -0.21702598
## M345 345  0.61475385 -0.9278049 -0.9914553 -0.91414870 -1.2964118 -0.35445886
##              K_P       pH_P     Cond_P     Phos_B      Glu_B       SOC_B
## M141 -0.57305476 -1.0962304 -0.7126877  0.4658142  0.5925751  0.03849061
## M10  -0.22773766 -0.6933468 -0.3154310  0.5962804  0.9344154  0.12707645
## M115 -0.54962348 -1.2733935 -0.4991138  0.9836193  0.6924069  0.72350793
## M231  0.07564058  1.0649627  0.6859244 -0.5120815 -0.2591886 -0.41874832
## M345  0.31951086  0.2374789 -0.8427737 -1.0701228 -1.0747579 -0.97663211
##            NT_B        PT_B        K_B        pH_B      Cond_B     Phos_T
## M141 -0.3137321 -0.48388312 -0.6524294 -1.06489899 -0.64025123  0.2466561
## M10   0.1699264  3.98446315 -0.5489790 -0.42773019 -0.45433535  1.3444510
## M115  0.4105123  0.08044399 -0.6885907 -1.00903348 -0.08134252  1.1212807
## M231  0.4288246 -0.34792356  0.4825208  1.24068573  0.12136046 -0.6388680
## M345 -1.3550670 -0.46576956 -0.5070546 -0.01440091 -0.88139037 -1.0322060
##           Glu_T       SOC_T       NT_T       PT_T        K_T        pH_T
## M141  0.4864576 -0.09301931 -0.2889095 -0.4978368 -0.7313626 -1.03402477
## M10   1.6150132  0.70790908  0.7369885  5.7421401  0.3480502 -0.54822964
## M115  0.8822342  0.92818134  0.6674018 -0.1018243 -0.5895893 -1.20342837
## M231 -0.4456760 -0.50093012  0.3758693 -0.3097234  0.3664316  1.20662937
## M345 -1.0636267 -0.94240733 -1.3234062 -0.4174604 -0.3905646  0.07737753
##           Cond_T
## M141 -0.88320505
## M10  -0.03094146
## M115 -0.46360706
## M231  0.35777356
## M345 -1.14094739
## Clustering vector:
##   M1   M2   M3   M4   M5   M6   M7   M8   M9  M10  M11  M12  M13  M14  M15  M16 
##    1    1    1    1    1    2    2    2    2    2    3    3    3    3    3    1 
##  M17  M18  M19  M20  M21  M22  M23  M24  M25  M26  M27  M28  M29  M30  M31  M32 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    3    3 
##  M33  M34  M35  M36  M37  M38  M39  M40  M41  M42  M43  M44  M45  M46  M47  M48 
##    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
##  M49  M50  M51  M52  M53  M54  M55  M56  M57  M58  M59  M60  M61  M62  M63  M64 
##    3    3    3    3    3    3    3    1    1    1    1    1    1    1    1    1 
##  M65  M66  M67  M68  M69  M70  M71  M72  M73  M74  M75  M76  M77  M78  M79  M80 
##    1    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
##  M81  M82  M83  M84  M85  M86  M87  M88  M89  M90  M91  M92  M93  M94  M95  M96 
##    3    3    3    3    3    1    1    1    1    1    1    1    1    1    1    2 
##  M97  M98  M99 M100 M101 M102 M103 M104 M105 M106 M107 M108 M109 M110 M111 M112 
##    2    2    2    2    3    3    3    3    3    3    3    3    3    3    3    3 
## M113 M114 M115 M116 M117 M118 M119 M120 M121 M122 M123 M124 M125 M126 M127 M128 
##    3    3    3    3    3    3    3    3    1    1    1    1    1    3    3    3 
## M129 M130 M131 M132 M133 M134 M135 M136 M137 M138 M139 M140 M141 M142 M143 M144 
##    3    3    3    3    3    3    3    1    1    1    1    1    1    1    1    1 
## M145 M146 M147 M148 M149 M150 M151 M152 M153 M154 M155 M156 M157 M158 M159 M160 
##    1    1    1    1    1    1    3    3    3    3    3    1    1    1    1    1 
## M161 M162 M163 M164 M165 M166 M167 M168 M169 M170 M171 M172 M173 M174 M175 M176 
##    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    1 
## M177 M178 M179 M180 M181 M182 M183 M184 M185 M186 M187 M188 M189 M190 M191 M192 
##    1    1    1    1    3    3    3    3    3    3    3    3    3    3    3    3 
## M193 M194 M195 M196 M197 M198 M199 M200 M201 M202 M203 M204 M205 M206 M207 M208 
##    3    3    3    4    4    4    4    4    4    4    4    4    4    4    4    4 
## M209 M210 M211 M212 M213 M214 M215 M216 M217 M218 M219 M220 M221 M222 M223 M224 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4 
## M225 M226 M227 M228 M229 M230 M231 M232 M233 M234 M235 M236 M237 M238 M239 M240 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4 
## M241 M242 M243 M244 M245 M246 M247 M248 M249 M250 M251 M252 M253 M254 M255 M256 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4 
## M257 M258 M259 M260 M261 M262 M263 M264 M265 M266 M267 M268 M269 M270 M271 M272 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4 
## M273 M274 M275 M276 M277 M278 M279 M280 M281 M282 M283 M284 M285 M286 M287 M288 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4 
## M289 M290 M291 M292 M293 M294 M295 M296 M297 M298 M299 M300 M301 M302 M303 M304 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4    4 
## M305 M306 M307 M308 M309 M310 M311 M312 M313 M314 M315 M316 M317 M318 M319 M320 
##    4    4    5    5    4    5    4    4    4    4    4    4    4    4    4    4 
## M321 M322 M323 M324 M325 M326 M327 M328 M329 M330 M331 M332 M333 M334 M335 M336 
##    4    4    4    4    4    4    4    4    4    4    5    5    5    5    5    5 
## M337 M338 M339 M340 M341 M342 M343 M344 M345 M346 M347 M348 M349 M350 M351 M352 
##    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
## M353 M354 M355 M356 M357 M358 M359 M360 M361 M362 M363 M364 M365 M366 M367 M368 
##    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5 
## M369 M370 M371 M372 M373 M374 M375 M376 M377 M378 M379 M380 M381 M382 M383 M384 
##    5    5    5    5    5    5    5    5    5    5    5    5    5    5    1    1 
## M385 M386 M387 M388 M389 M390 M391 M392 M393 M394 M395 M396 M397 M398 M399 M400 
##    5    2    2    2    2    2    5    5    5    5    5    5    5    1    1    5 
## M401 M402 M403 M404 M405 M406 M407 M408 M409 M410 M411 M412 M413 M414 M415 M416 
##    5    5    3    5    5    5    5    5    5    5    5    5    5    5    5    5 
## M417 M418 M419 M420 M421 M422 M423 M424 M425 M426 M427 M428 M429 M430 
##    5    5    4    5    5    5    5    4    5    5    5    5    5    5 
## Objective function:
##    build     swap 
## 2.970077 2.956713 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

Using the dendrogram with complete linkage on Manhattan distance with the silhouette method, we identified k=5 as the optimal number of clusters.

By analyzing this clusters, we can differentiate some of them. Firstly, cluster 1 seems to be more or less in the average of the other clusters for all variables, except for its pH, (pH_B, pH_P and pH_T) which is well below the others. Cluster 2 is distinguished by its high PT_P, PT_T and PT_B. Cluster 3 is characterized by a high pH_B, pH_P and pH_T and a low β-glucosidase enzyme (Glu_B, Glu_P and Glu_T) content (like cluster 5). By looking at the distribution of cluster 4 we see that it contains few observations but that it is well distinguished from the others. Indeed, only one line appears where the median merges with the width of its distribution. This cluster has the lowest radiation, the highest soil organic carbon (SOC_B, SOC_T and SOC_P) and highest soil total nitrogen (NT_T, NT_B and NT_P). Finally, the last cluster has a pH (pH_B, pH_P and pH_T) close to zero and a high radiation. It has oservations with the lowest soil organic carbon (SOC_B, SOC_T and SOC_P)

  • Choose a model and say why

For modeling part : Concerning the ROC metric and more precisely the AUC : ###

Limitations of our model:

The main limitations come from our data set used to train the model and test it. Indeed, as we noticed during the EDA we have a lot of duplicate data for the Sierra de Guadarrama region, these data when not removed from the data set influence the accuracy of both the test set and the train set. It is obvious that the prediction of identical data leads to a good result. Moreover, even if we remove these duplicates, the fact that the sub samples were taken close to each other prevents us from having results that are representative of a real study that would be conducted with samples taken from more widely spaced areas. We therefore have a false good precision, since the values to predict are close to those used to train the model. Ideally we should use only one sub sample per sample for our model, but in this case we would be sorely lacking in data to train our model.